Contiguous PA #424

mfylcek · 2024-10-24T13:15:00Z

Contiguous cache fetching to avoid using costly gather operation. Requires changes in vllm-hpu-extension (HabanaAI/vllm-hpu-extension#17) to work.

Introduces redundant calculations in decoding phase. In all tested cases improves performance over the entire run (5-12%). For even better performance cache defragmentation is required. Only compatible with v2-block-manager.

xuechendi · 2024-10-24T15:45:17Z

Hi, @mfylcek , we've been follow this branch, just did test on Gaudi3 - static batch_size as 128

test script:

python -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 1 --max-num-seqs 128 --disable-log-requests --dtype bfloat16 --block-size 128 --gpu-memory-util 0.9 --num-lookahead-slots 1 --use-v2-block-manager  --max-model-len 4096

# repeat command below 3 times for warming up then get final result
python benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name sonnet --dataset-path ./sonnet.txt --request-rate 512 --num-prompts 56 --port 8080 --sonnet-input-len 1024 --sonnet-output-len 1024 --sonnet-prefix-len 100

xuechendi · 2024-10-25T00:22:42Z

@mfylcek @michalkuligowski ,
our team submitted this PR: #426 which will effectively reduce fragmentation when creating block_list.

Here is the performance I measured with PR 426
data deleted

xuechendi · 2024-10-25T00:26:16Z

From observation, after warm up,

BlockList with only PR424 will be double-sized with lots of [-1]
Blocklist with PR424 + PR426, blocklist size will be as same as previous run

This reverts commit 5b7f685.

Reverts #424

Contiguous cache fetching to avoid using costly gather operation. Requires changes in vllm-hpu-extension (HabanaAI/vllm-hpu-extension#17) to work. Introduces redundant calculations in decoding phase. In all tested cases improves performance over the entire run (5-12%). For even better performance cache defragmentation is required. Only compatible with v2-block-manager.

Reverts #424

mfylcek added 6 commits October 24, 2024 13:20

Contiguous PA POC

56ea910

Limit block_bucket_size

b03fb6e

Warmup for buckets with max_blocks

a490556

block_mapping padding fix

05b3d09

New softmax

3b028cc

Softmax normalization adjustment

f3488dc

mfylcek changed the base branch from main to habana_main October 24, 2024 13:15

michalkuligowski requested a review from madamczykhabana October 24, 2024 13:26

mfylcek added 3 commits October 25, 2024 10:00

Merge branch 'habana_main' into dev/mfylcek/contiguous_pa_main_24_10

9762512

Formatting

e8dfc9e

Type annotations

e209acf

mfylcek added the habana Issues or PRs submitted by Habana Labs label Oct 25, 2024

michalkuligowski approved these changes Oct 25, 2024

View reviewed changes

michalkuligowski merged commit 5b7f685 into habana_main Oct 25, 2024
19 checks passed

michalkuligowski deleted the dev/mfylcek/contiguous_pa_main_24_10 branch October 25, 2024 12:35

madamczykhabana restored the dev/mfylcek/contiguous_pa_main_24_10 branch October 25, 2024 12:42

madamczykhabana added a commit that referenced this pull request Oct 25, 2024

Revert "Contiguous PA (#424)"

7bf2a9e

This reverts commit 5b7f685.

madamczykhabana mentioned this pull request Oct 25, 2024

Revert "Contiguous PA" #432

Merged

madamczykhabana added a commit that referenced this pull request Oct 25, 2024

Revert "Contiguous PA" (#432)

e3ae2eb

Reverts #424

xuechendi mentioned this pull request Oct 25, 2024

Reduce block fragmentation #426

Merged

1 task

afierka-intel pushed a commit that referenced this pull request Oct 26, 2024

Revert "Contiguous PA" (#432)

bdc2ab3

Reverts #424

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contiguous PA #424

Contiguous PA #424

mfylcek commented Oct 24, 2024

xuechendi commented Oct 24, 2024 •

edited

Loading

xuechendi commented Oct 25, 2024 •

edited

Loading

xuechendi commented Oct 25, 2024

Contiguous PA #424

Contiguous PA #424

Conversation

mfylcek commented Oct 24, 2024

xuechendi commented Oct 24, 2024 • edited Loading

xuechendi commented Oct 25, 2024 • edited Loading

xuechendi commented Oct 25, 2024

xuechendi commented Oct 24, 2024 •

edited

Loading

xuechendi commented Oct 25, 2024 •

edited

Loading